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In the last several years, tightly coupled PC clusters have become widely applied, cost effective resources for 
lattice gauge computations. This paper discusses the practice of building such clusters, in particular balanced 
design requirements. I review and quantify the improvements over time of key performance parameters and 
overall price to performance ratio. Applying these trends and technology forecasts given by computer equipment 
manufacturers, I predict the range of price to performance for lattice codes expected in the next several years. 



1. INTRODUCTION 

The simulation codes of lattice gauge theory 
require substantial computing resources in order 
to calculate various matrix elements with suffi- 
cient precision to test the Standard Model against 
emerging experimental measurements. Histori- 
cally, these codes have demanded the use of large 
supercomputers at significant cost. Both general 
purpose commercial supercomputers and custom, 
or "purpose-built", supercomputers have been 
employed. Traditional supercomputers came 
with very high prices. The price of purpose-built 
supercomputer hardware was lower, but the de- 
sign and construction of such machines required 
significant amounts of engineering and physicist 
manpower. 

In the last half decade, the performance of com- 
modity computing equipment has increased to the 
point that tightly coupled clusters of such ma- 
chines can compete with traditional supercom- 
puters in capacity (lattice size) and throughput 
(MFlop/sec), and with purpose-built supercom- 
puters in price/performance. Commodity sys- 
tems have been so successful across a wide spec- 
trum of applications in many academic fields, 
that more than half of the supercomputers listed 
on the "Top500" P supercomputer list are clus- 
ters. 

In this paper, I discuss the requirements placed 
on clusters by lattice QCD codes and the histor- 
ical performance trends of commodity comput- 
ing equipment for meeting those requirements. 



Extrapolating from these trends, together with 
vendor roadmaps, allows prediction of the perfor- 
mance and price/performance of reasonable clus- 
ter designs in the next few years. 

2. DESIGNING BALANCED SYSTEMS 

Inversion of the Dirac operator (Dslash) is the 
most computationally intensive task of lattice 
codes. The improved staggered action (asqtad) 
will be used throughout this paper for quantita- 
tive examples. During each iteration of the in- 
version of the improved staggered Dslash, eight 
sets of SU(3) matrix- vector multiplies occur using 
nearest and next-next-nearest neighbor spinors. 
When domain decomposition is used on a clus- 
ter, ideally these floating point operations overlap 
with the communication of the hyper-surfaces of 
the sub-lattices held on neighboring nodes. Using 
global sums, the results of these sweeps over the 
full lattice are accumulated and communicated to 
all nodes in order to modify the spinors for the 
next iteration. 

Dslash inversion throughput depends upon the 
floating point performance of the processors, the 
bandwidth available for reading operands from 
memory, the throughput of the I/O bus of the 
cluster nodes, and the bandwidth and latency of 
the network fabric connecting the computers. On 
any cluster, one of these factors will be the limit- 
ing factor which dictates performance for a given 
problem size. Minimization of price/performance 
requires designs which balance these factors. 
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2.1. Floating Point Performance 

Most floating point operations in lattice codes 
occur during SU(3) matrix-vector multiplies. For 
operands in cache, the throughput of these multi- 
plies is dictated by processor clock speed and the 
capabilities of the floating point unit. Table 
shows the performance of matrix-vector kernels 
on four Intel processors introduced since the year 
2000. The "C" language kernels used are from the 
MILC |2 code. The use of SIMD instructions on 
Intel-brand and compatible CPUs, as suggested 
by Csikor et al. for AMD K6-2 CPUs and im- 
plemented for the Intel SSE unit by Liischer g|, 
can give significant performance improvements. 
Table^lists the performance of two styles of SSE 
implementation. The first, site wise, uses a con- 
ventional data layout scheme with the real and 
imaginary pieces of individual matrix and vec- 
tor elements adjacent in memory. The second, 
fully vectorized, follows Pochinsky's |S] practice 
of placing the real components of the operands 
belonging to four consecutive lattice sites consec- 
utively in memory, followed by the four imagi- 
nary components. Whereas site wise implemen- 
tations require considerable shuffling of operands 
in the SSE registers in order to perform complex 
multiplies, the fully vectorized form requires only 
loads, stores, multiplies, additions, and subtrac- 
tions. 

2.2. Memory Performance 

The bandwidth of access to main memory by 
processors depends upon the width and the clock 
speed of the data bus. Intel and compatible ia32 
architecture processors use 64-bit data buses ex- 
clusively. The effective speed of the so-called front 
side bus, or FSB, has increased from 66 MHz in 
the mid-90's, to 800 MHz today. The correspond- 
ing peak memory bandwidths have increased from 



Table 2 
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528 MB/sec to 6400 MB/sec. According to Intel 
roadmaps, processors with 1066 MHz FSB and 
8530 MB/sec peak bandwidths will be available 
by November of 2004. The doubling time for the 
exponential fit to these bandwidths is 1.87 years. 
The doubling for achievable bandwidth, measured 
using the STREAMS benchmark, is 1.71 years. 
With SSE optimizations, the achieved doubling 
time decreases to 1.49 years. 

From memory bandwidth measurements, us- 
ing tools such as STREAMS, an estimate of the 
throughput of SU(3) matrix-vector multiply ker- 
nels can be made in the case in which all operands 
come from main memory, typical for lattice QCD 
codes. For single precision calculations, each 
matrix-vector multiply requires 96 input bytes, 
24 output bytes, and 66 floating point operations. 
The throughput is given by this flop count divided 
by the memory access speed, weighted appropri- 
ately according to read and write rates. Tabled 
shows the main memory matrix- vector through- 
put for six generations of ia32 processor, along 
with the conventional and SSE assisted read and 
write rates. Comparing Table|3to Table clearly 
shows that memory bandwidth constrains lattice 
QCD code performance. 

2.3. Communications Requirements and 
I/O Bus Performance 

Gottlieb has presented a very useful model 
for understanding the communications require- 
ments of lattice QCD code. Modified for this 
paper for the improved staggered action, this 
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model assumes a hypercubic lattice evenly di- 
vided among the nodes of a cluster, with inter- 
node communications occurring along all 4 di- 
rections. The size of the sub-lattice stored on 
each node, along with the requirement of the al- 
gorithm that data from the three outermost hy- 
perplanes in each direction be communicated be- 
tween neighboring nodes, gives the size of the 
messages interchanged during each iteration of 
the Dslash inverter. Therefore, for any assumed 
Dslash performance and sub-lattice size, one can 
easily determined the necessary communications 
bandwidth. The required bandwidth increases 
with decreasing sub-lattice size, and increases 
with increasing Dslash throughput. 

The maximum communications rate between 
nodes in a cluster is limited by the smaller of 
the I/O bus and network band widths. Figure ^ 
shows the required bandwidths from the model 
as a function of message size for a variety of as- 
sumed Dslash throughputs. The labeled horizon- 
tal lines show the burst communications rates of 
various I/O buses, from the 132 MB/sec rate for 
the 32-bit, 33 MHz PCI bus of the mid-90's, to 
the 2000 MB/sec rate for the four-lane PCI Ex- 
press (PCI-E) introduced in 2004. For any of the 
I/O architectures shown, the achievable commu- 
nications rate will be no more than perhaps 75% 
of these burst rates. This plot shows that for cur- 
rent processors, capable of achieving 800 to 1600 
MFlop Dslash throughput, PCI-X (64-bit, 133 
MHz) buses are sufficient. Furthermore, currently 
available sixteen lane PCI-E will be more than 
sufficient for at least six more years, when proces- 
sors could achieve at least 10 GFlop throughput. 

2.4. Network Fabric Performance 

A number of network fabrics exist with suffi- 
cient performance for lattice QCD clusters. These 
include gigabit ethernet employing switches or 
toroidal meshes |8I9| . Myrinet, Quadrics, SCI, 
and Infiniband. Gigabit ethernet meshes of high 
dimensionality have the advantage of very low 
cost, but the disadvantages of large numbers of 
cables, the need for custom software, and sen- 
sitivity to node failures. SCI, another multi- 
dimensional toroidal mesh, is robust against node 
failures but at higher cost than gigabit ether- 
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Figure 1. The diagonal lines show the re- 
quired communication bandwidths as a function 
of message size and assumed Dslash performance 
(MFlop/sec). The marked message sizes, L, cor- 
respond to sub-lattices of size L 4 . The horizontal 
lines show burst rates for these buses: PCI 32-bit, 
33 MHz (132 MB/sec), PCI 64-bit, 66 MHz (528 
MB/sec), PCI-X 64-bit, 133 MHz (1064 MB/sec), 
and 4X PCI-E (2000 MB/sec). 

net. Myrinet, Quadrics, and Infiniband all em- 
ploy switched fabrics and have been used in large 
(order 1000) node clusters in fields outside of lat- 
tice QCD. 

Examples of communications performance for 
Myrinet (LANai9, PCI64B) and Infiniband (PCI- 
X) networks are shown in Fig. [21 Typical for all 
fabrics is the bandwidth saturation at large mes- 
sage sizes, limited by either the I/O bus or the 
network itself, and the steady decrease in band- 
width with decreasing message size because of the 
delay (latency) necessary to setup and process a 
communication. Dslash inversion usually involves 
message sizes of order 1000 bytes or higher. The 
dispersion of bandwidth with message size deter- 
mines how small a sub-lattice may be employed. 
For a fixed problem size, increasing the number of 
nodes decreases the time required to perform the 
calculation when the parallel computer is limited 
by floating point performance or memory band- 
width. However, since bandwidth also declines 
with the smaller message sizes, as the number of 
nodes increases eventually the network will be- 
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Figure 2. Network bandwidth as a function of 
message size, measured using the MPI Netpipe 
benchmark. 



come the limiting performance factor. 

A rough estimate of this cutoff for a given 
sub-lattice size may be obtained by superimpos- 
ing the network bandwidth dispersion curve onto 
the model curves of the last section. See Fig. |3 
for an example using Myrinet, where the disper- 
sion curve was obtained using a two-node MPI 
Netpipe ^U] benchmark. For this network, mes- 
sage sizes of at least 10 4 bytes are required for 
800 MFlop Dslash throughput. Note, however, 
that this cutoff estimate is an optimistic upper 
bound. Unlike Netpipe, there is contention for 
both the I/O and memory buses when lattice 
QCD code runs. I/O bus contention results from 
in-bound and out-bound messages occurring si- 
multaneously. Competition for the memory bus 
results from the overlap of communications with 
computation. 

As with floating point performance, memory 
bandwidth, and I/O bus capability, network fab- 
rics have steadily improved in performance over 
the last decade. Ethernet speeds have increased 
from 10 Mbit/sec in the early '90s to 10 Gbit/sec. 
Inflniband, the newest network fabric, currently 
is available with 8 Gbit /sec bandwidth in each di- 
rection, with 24 Gbit/sec expected in 2005. Such 
high bandwidth networks raise the network dis- 
persion curve of Fig. [3] sufficiently to support 
many forthcoming generations of processors. 
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Figure 3. The Myrinet dispersion curve of Fig. [3 
is superimposed on the communications model of 
Fig. ^ in order to estimate network limited per- 
formance as a function of processor performance 
and sub-lattice size. 



3. OPTIMIZING PRICE/PERFORMANCE 

Many factors must be taken into consid- 
eration when building clusters to optimize 
price/performance. As discussed above, either 
floating point performance, memory bandwidth, 
or communications performance will be the limit- 
ing factor for throughput. It makes little sense to 
spend additional funds for faster nodes or larger 
clusters if the network fabric limits performance. 
On the other hand, an investment in network 
hardware with excess bandwidth can be very cost 
effective, as the fabric may be reused when nodes 
are upgraded or replaced. 

3.1. Node Costs 

At the present time, low cost commodity com- 
puters are available with either one or two pro- 
cessors. Computers with more than two proces- 
sors exist, but are significantly more expensive 
per processor. Since network interface cards rep- 
resent a significant fraction of the total cost of 
a tightly coupled cluster, minimizing the num- 
ber of interfaces by using dual processor systems 
can greatly lower overall costs. Also, dual CPU 
systems lower the labor costs for building and ad- 
ministrating clusters. On the other hand, single 
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processor systems as a rule have greater memory 
bandwidth per processor. 

Over the last several years, the most cost effec- 
tive computers for single node computations have 
been single processor machines. At any given 
time, the highest front side bus speeds, 800 MHz 
currently and soon 1066 MHz, have been available 
only on single processor systems. Furthermore, 
these systems are sold in huge volumes as busi- 
ness desktops and home machines, driving prices 
down. Their use in clusters, however, was ques- 
tionable before 2004 because none of these sys- 
tems had fast PCI I/O buses. In 2004, systems 
with PCI-X and PCI Express buses entered the 
market. In Fermilab's May 2004 purchase of 130 
single Pentium 4E systems, the cost per node was 
approximately $900 for systems with server-class 
motherboards, 2.8 GHz processors, 1 GByte of 
system memory, and PCI-X I/O. 

Dual processor systems generally cost less than 
two times the price of corresponding single pro- 
cessor computers. Systems based on Intel ia32 
processors have shared memory buses; the proces- 
sors in these systems compete with each other for 
memory bandwidth, and as a consequence SMP 
scaling on lattice codes is poor. However, these 
systems tend to have very capable PCI I/O buses. 
The correct approach when high performance I/O 
is required is to purchase dual-capable systems 
populated with only a single processor. 

Since mid-2003, dual-processor systems based 
on AMD's Opteron processor have been available. 
These systems include a memory controller em- 
bedded in each processor as well as distinct local 
memory buses attached to each CPU. Access from 
an Opteron processor to memory attached to the 
other CPU is considerably slower than access to 
the local memory. Optimizing lattice codes on 
these computers requires modifications to the op- 
erating system and user code to take into account 
the non-uniform memory architecture. 

3.2. Network Costs 

As a rule, the cost of high performance network 
fabrics is at least half as much, and often equal 
to, the cost of the computing nodes. Further- 
more, distinct jumps in the cost per node of net- 
work fabrics occur as clusters grow in node count 



beyond the size of the largest available switch. 
Larger clusters require cascading of switches, with 
a correspondingly higher cost per switch port. 
Typical costs for non-cascaded switched fabrics 
based on Myrinet or Infiniband are approximately 
$1000 per node, including the switch, cabling, 
and network interface card. The largest Myrinet 
switch available at present has 256 ports. The 
largest Infiniband switch has 288 ports. 

Lattice QCD clusters with gigabit ethernet 
mesh fabrics typically have six or more ethernet 
ports per node, with each port connected directly 
to a neighboring node. Dual port interfaces are 
available for approximately $150 each. The lower 
cost of these meshes must be balanced against 
larger cable plants, the need for custom communi- 
cations software, and the sensitivity of the cluster 
to node failures. 

4. HISTORICAL TRENDS AND PRE- 
DICTIONS 

Figure|l|shows the price/performance of MILC 
improved staggered code for five clusters built 
since late 1998, an estimate of price/performance 
for the new Fermilab cluster currently being com- 
missioned, and predictions of price/performance 
for clusters to be built in the next three years. 
The oldest cluster shown utilized Pentium II pro- 
cessors with 100 MHz memory buses. The newest 
existing cluster uses Pentium 4E processors with 
800 MHz FSB. From the fit to the existing clus- 
ter data, the halving time for price/performance 
is 1.25 years. 

Given the historical performance trends, 
along with vendor roadmaps, we can at- 
tempt predictions of future lattice QCD cluster 
price/performance. These predictions are based 
upon the following assumptions: 

• Intel ia32 processors will be available at 4.0 
GHz and 1066 MHz FSB in 2005. 

• Processors will be available either singly at 
5.0 GHz, or in dual core equivalence (e.g., 
dual core 4.0 GHz processors) in 2006. 

• Equivalent memory bus speed will exceed 
1066 MHz by 2006 through fully buffered 
DIMM technology or other advances. 
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Table 3 



Price/Performance Predictions. Performance units are GFlop/sec per node. 
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Figure 4. Price/performance measurements and 
predictions for Intel ia32 clusters. Shown are 
measured (1998 - 2003) and estimated (2004) 
price/performance values for clusters at Sandia 
National Laboratory, Indiana University, and Fer- 
milab, as well as predicted price/performance val- 
ues for clusters to be built in late 2004, 2005, and 
2006. 



• The cost of high performance networks such 
as Infiniband will drop as these networks 
increase in sales volume and the network 
interfaces are embedded on motherboards. 

The predictions assume that several new tech- 
nologies are delayed by one year from their first 
appearance on current vendor roadmaps. For ex- 
ample, vendor roadmaps predict that 1066 MHz 
memory buses will appear in 2004, dual core pro- 
cessors in 2005, and fully buffered DIMM tech- 
nology in 2005. By year, the details of the pre- 
dicted values in Fig. 0] also summarized in Ta- 
ble |21 are as follows. In mid-2004, the latest 



Fcrmilab cluster used 2.8 GHz P4E systems at 
$900/node. The measured sustained performance 
of this cluster varies from approximately 900 to 
1100 MFlop/node, depending upon lattice lay- 
out (i.e., the number of directions of communi- 
cations). A Myrinet fabric from an older cluster 
was reused; this fabric has an estimated replace- 
ment cost of $900 per node. In late 2004, a cluster 
based on 3.4 GHz P4E processors with PCI-E and 
Infiniband would sustain 1.4 GFlop/node, based 
on the faster processors and the improved com- 
munications. In late 2005, a cluster based on 
4.0 GHz processors with 1066 MHz FSB would 
sustain 1.9 GFlop/node, based upon faster pro- 
cessors and higher memory bandwidth. In late 
2006, a cluster based on the equivalent of 5.0 GHz 
processors with memory bandwidth greater than 
1066 MHz FSB would sustain 3.0 GFlop/node. 

5. LIMITS TO PRACTICAL CLUSTER 
SIZE 

The network fabrics used on clusters limit both 
achievable performance and cost effectiveness. As 
discussed previously, the largest single high per- 
formance network switches currently available are 
288-port Infiniband switches. To build a larger 
cluster based on such a switched network, cas- 
cading of multiple switches is required. To pre- 
serve bisectional bandwidth through the fabric, 
switches in a two-layer cascaded fabric have as 
many connections to other switches as they do to 
compute nodes. Cascading increases the switch 
costs of a fabric. 

Toroidal gigabit ethernet mesh designs do not 
have this limitation. However, the use of ether- 
net requires custom communications software to 
replace the traditional TCP/IP communications 
protocol; TCP/IP introduces too much latency 
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for lattice QCD codes. In contrast, the com- 
munications software which is supplied with net- 
works such as Myrinet and Infiniband not only 
is widely used and robust, but it also requires 
no modification for lattice QCD. In terms of re- 
duced custom software development, significant 
benefits may be derived from using popular high 
performance switched networks, even though the 
hardware costs may be greater. 

The term "strong scaling" refers to the decrease 
in time to solve a fixed size problem as additional 
nodes are employed. Communications latencies 
limit strong scaling. As node counts increase, 
the size of the local lattice stored on each node 
decreases, and so the size of the messages used 
to communicate neighboring hyperplanes also de- 
creases. Because of the dispersion of communi- 
cations bandwidth with message size caused by 
latency, the decreasing bandwidth available with 
shorter messages will eventually limit the perfor- 
mance as the number of nodes increases. 

The reliability of the nodes in a cluster will 
limit the length of the longest calculation. Typ- 
ical MTBF figures for commodity computers are 
of order 10 5 to 10 6 hours. For 10 3 nodes, an 
MTBF of 10 5 hours will result in an average of 
one hardware failure every 100 hours. Operat- 
ing system stability may play a role as well, with 
"mean time between reboots" similarly dictating 
maximum job lengths. This problem can be ad- 
dressed by checkpointing long calculations at reg- 
ular intervals, so that they may be restored at an 
intermediate position after cluster repair. Note 
that switched networks are very tolerant of node 
failure in that a given sublattice may be relocated 
to any available node in the cluster at the start 
of the next job. Mesh networks, on the other 
hand, are generally limited to nearest computer 
neighbor communications unless a large latency 
penalty is incurred. The loss of a node within one 
of the dimensions of a mesh architecture requires 
rewiring to route around the failed computer. 

6. CONCLUSIONS 

Since 1999, PC clusters have exhibited steadily 
improving price/performance for lattice QCD; 
the measured price/performance halving time for 



improved staggered codes over this time period 
was 1.25 years. Performance trends indicate that 
balanced designs will be achievable on large scale 
clusters in the future. With the advent of PCI-E, 
I/O bus designs will have more than sufficient 
bandwidth to match the communications require- 
ments of many future generations of processors. 
Networks such as Infiniband similarly have ex- 
cess bandwidth today, and vendor roadmaps in- 
dicate performance growth which will pace or ex- 
ceed processor requirements. Improvements in 
memory designs should provide sufficient mem- 
ory bandwidth to balance faster processors. 

To date, the largest clusters in the US specifi- 
cally devoted to lattice QCD have been no larger 
than 256 processors and have been based on 
Myrinet or gigabit mesh networks. Based on 
performance and cost trends, it is clear that 
significant clusters will be constructed in the 
coming years. A 512 processor cluster in 2005 
should sustain 1.9 GFlop/sec per node on the im- 
proved staggered action at less than $l/MFlop 
price/performance. By 2006, a cluster with sev- 
eral thousand processors should sustain multiple 
TFlop/sec per node for less than $0.50/MFlop. 
Leveraging the results of the wide spread use of 
commodity clusters, these facilities will require 
neither specialized designs nor operational proce- 
dures. 
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